This is a quick walkthrough demonstrating how to generate SWNE plots alongside the Seurat pipeline using a 3k PBMC dataset as an example.

To save time we will be using the pre-computed Seurat object pbmc3k_seurat.Robj, which can be downloaded here.

First let’s load the required libraries

library(Seurat)
library(swne)

Next let’s load the Seurat object

se.obj <- readRDS("~/swne/Data/pbmc3k_seurat.Robj")

Most scRNA-seq pipelines only use a subset of highly overdispersed genes for analysis. We’ll pull out those variable genes here, as well as the cluster labels.

## Pull out overdispersed genes as defined by Seurat
var.genes <- se.obj@var.genes
length(var.genes)
## [1] 1838
## Pull out cell clusters as defined by Seurat
cell.clusters <- se.obj@ident; names(cell.clusters) <- se.obj@cell.names;
levels(cell.clusters)
## [1] "B cells"           "CD4 T cells"       "CD8 T cells"      
## [4] "CD14+ Monocytes"   "Dendritic cells"   "FCGR3A+ Monocytes"
## [7] "Megakaryocytes"    "NK cells"

We’ll use Seurat to build an Shared Nearest Neighbors network (SNN) to smooth the cell embeddings with later

se.obj <- BuildSNN(se.obj, dims.use = 1:20, k.param = 20, prune.SNN = 1/20)

The easiest way to generate an SWNE embedding is to use the wrapper function RunSWNE

## Run SWNE
genes.embed <- c("MS4A1", "GNLY", "CD3E", "CD14",
                 "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A")
swne.embedding <- RunSWNE(se.obj, k = 10, var.genes = var.genes, genes.embed = genes.embed)
## [1] "1838 variable genes to use"
## Plot SWNE
PlotSWNE(swne.embedding, alpha.plot = 0.4, sample.groups = cell.clusters,
         do.label = T, label.size = 3.5, pt.size = 1.5, show.legend = F,
         seed = 42)

Now we’ll go through the SWNE embedding process step by step

First, let’s pull out the counts, scale and adjust gene variance, while keeping the scaled matrix nonnegative.

norm.counts <- ExtractNormCounts(se.obj, obj.type = "seurat", rescale = T, rescale.method = "log")
## calculating variance fit ... using gam
## Warning in pf(exp(df$res), n.obs, n.obs, lower.tail = F, log.p = F): NaNs
## produced
dim(norm.counts)
## [1] 13714  2638

We use the FindNumFactors function to identify the optimal number of factors to use. This function can be slow for large datasets, since it iterates over different values of k, so a simple “hack” is to just set k equal to the number of significant principal components.

k.range <- seq(2,16,2) ## Range of factors to iterate over
k.err <- FindNumFactors(norm.counts[var.genes,], k.range = k.range, n.cores = 8, do.plot = T)

We then run the NMF decomposition. We can initialize the NMF using either Independent Component Analysis (ica), Nonnegative SVD (nnsvd), or a completely random initialization. ICA is recommended for most datasets. The output of RunNMF is a list of the gene loadings (W) and NMF embedding (H).

k <- 10
nmf.res <- RunNMF(norm.counts[var.genes,], k = k, alpha = 0, init = "ica", n.cores = 8)
nmf.scores <- nmf.res$H

We can either use the pre-computed Shared Nearest Neighbors (SNN) matrix from Seurat or re-compute it ourselves.

# pc.scores <- t(GetCellEmbeddings(se.obj, reduction.type = "pca", dims.use = 1:k))
# snn <- CalcSNN(pc.scores)
snn <- se.obj@snn

Runs the SWNE embedding. The three key parameters are alpha.exp, snn.exp, and n_pull, which control how the factors and neighboring cells affect the cell coordinates.

alpha.exp <- 1.25 # Increase this > 1.0 to move the cells closer to the factors. Values > 2 start to distort the data.
snn.exp <- 1.0 # Lower this < 1.0 to move similar cells closer to each other
n_pull <- 3 # The number of factors pulling on each cell. Must be at least 3.
swne.embedding <- EmbedSWNE(nmf.scores, snn, alpha.exp = alpha.exp, snn.exp = snn.exp,
                            n_pull = n_pull, proj.method = "umap", dist.use = "cosine")

For now, let’s hide the factors by setting their names to the empty string "". We’ll interpret them later

swne.embedding$H.coords$name <- ""

To help with interpreting these cell clusters, let’s pick some key PBMC genes to embed.

genes.embed <- c("MS4A1", "GNLY", "CD3E", "CD14",
                 "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A")

Since we only ran NMF on the overdispersed genes, we need to project the rest of the genes onto the NMF projection to get gene loadings for all genes.

nmf.res$W <- ProjectFeatures(norm.counts, nmf.scores, n.cores = 8)

Now we can embed the key PBMC genes onto the visualization and remake the plot

swne.embedding <- EmbedFeatures(swne.embedding, nmf.res$W, genes.embed, n_pull = 4)

Let’s make the SWNE plot with the key genes embedded. The closer a cell or a cluster is to a gene, the higher the expression level. We set a seed for reproducible cluster colors, so that every plot will use the same colors to label the clusters.

color.seed <- 42
PlotSWNE(swne.embedding, alpha.plot = 0.4, sample.groups = cell.clusters, do.label = T,
         label.size = 3.5, pt.size = 1.25, show.legend = F, seed = color.seed)

We can validate the embedded genes by overlaying the expression of one of these key genes onto the plot.

gene.use <- "CD8A"
gene.expr <- norm.counts[gene.use,]
FeaturePlotSWNE(swne.embedding, gene.expr, gene.use, alpha.plot = 0.4, label.size = 3.5, pt.size = 1.25)

We can also make a t-SNE plot for comparison.

tsne.scores <- GetCellEmbeddings(se.obj, reduction.type = "tsne")
PlotDims(tsne.scores, sample.groups = cell.clusters, pt.size = 1, label.size = 3.5, alpha = 0.4,
         show.legend = F, seed = color.seed, show.axes = F)

We can also interpret the factors by using the gene loadings matrix. Here, we extract the top 3 genes for each factor by gene loading. Since NMF tends to create a parts-based representation of the data, the factors often correspond to key biological processes or gene modules that explain the data.

gene.loadings <- nmf.res$W
top.factor.genes.df <- SummarizeAssocFeatures(gene.loadings, features.return = 3)
head(top.factor.genes.df)
##   assoc_score feature   factor
## 1   0.6141023    FTH1 factor_1
## 2   0.4461639     LTB factor_1
## 3   0.4051296     B2M factor_1
## 4   1.5032733  C4orf3 factor_2
## 5   0.1596317    FTH1 factor_2
## 6   0.1204116     FTL factor_2

And finally, we can make a heatmap to visualize the top factors for each gene

gene.loadings.heat <- gene.loadings[unique(top.factor.genes.df$feature),]
ggHeat(gene.loadings.heat, clustering = "col")

Extract cluster colors for compatibility with other plotting methods (i.e. Monocle)

color.mapping <- ExtractSWNEColors(swne.embedding, sample.groups = cell.clusters, seed = color.seed)
color.mapping
##       CD4 T cells           B cells   CD14+ Monocytes          NK cells 
##         "#7CAE00"         "#FF61CC"         "#C77CFF"         "#F8766D" 
##       CD8 T cells FCGR3A+ Monocytes   Dendritic cells    Megakaryocytes 
##         "#00BFC4"         "#00A9FF"         "#00BE67"         "#CD9600"